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Abstract 

We develop a family of accelerated stochastic algorithms that minimize sums of convex 
functions. Our algorithms improve upon the fastest running time for empirical risk minimization 
(ERM), and in particular linear least-squares regression, across a wide range of problem settings. 

To achieve this, we establish a framework based on the classical proximal point algorithm. 
Namely, we provide several algorithms that reduce the minimization of a strongly convex func¬ 
tion to approximate minimizations of regularizations of the function. Using these results, we 
accelerate recent fast stochastic algorithms in a black-box fashion. 

Empirically, we demonstrate that the resulting algorithms exhibit notions of stability that 
are advantageous in practice. Both in theory and in practice, the provided algorithms reap the 
computational benefits of adding a large strongly convex regularization term, without incurring 
a corresponding bias to the original problem. 


1 Introduction 


A general optimization problem central to machine learning is that of empirical risk minimization 
(ERM): finding a predictor or regressor that minimizes a sum of loss functions defined by a data 
sample. We focus in part on the problem of empirical risk minimization of linear predictors: given 
a set of n data points a*,..., a„ G and convex loss functions (^j : M —>■ M for i = 1,... ,n, solve 


n 

min F(x), where F{x) = (pi{ajx). 

1 = 1 


( 1 ) 


This problem underlies supervised learning {e.g. the training of logistic regressors when (j)i{z) = 
log(l -|- or their regularized form when (l)i{z) = log(l -|- -|- ^||a ;||2 for a scalar 7 > 0 ) 

and captures the widely-studied problem of linear least-squares regression when (l)i{z) = ^(z — bi^. 

Over the past five years, problems such as (1) have received increased attention, with a recent 
burst of activity in the design of fast randomized algorithms. Iterative methods that randomly 
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sample the cjii have been shown to outperform standard first-order methods under mild assumptions 
(Bottou and Bousquet, 2008; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et ah, 2014; 
Shalev-Shwartz and Zhang, 2014). 

Despite the breadth of these recent results, their running time guarantees when solving the 
ERM problem (1) are sub-optimal in terms of their dependence on a natural notion of the prob¬ 
lem’s condition number (See Section 1.1). This dependence can, however, significantly impact their 
guarantees on running time. High-dimensional problems encountered in practice are often poorly 
conditioned. In large-scale machine learning applications, the condition number of the ERM prob¬ 
lem ( 1 ) captures notions of data complexity arising from variable correlation in high dimensions 
and is hence prone to be very large. 

More specifically, among the recent randomized algorithms, each one either: 

1. Solves the ERM problem (1), under an assumption of strong convexity, with convergence 
that depends linearly on the problem’s condition number (Johnson and Zhang, 2013; Defazio 
et ah, 2014). 

2. Solves only an explicitly regularized ERM problem, min 3 ;{E(x) -|- Ar(x)} where the regularizer 
r is a known 1-strongly convex function and A must be strictly positive, even when F is itself 
strongly convex. One such result is due to Shalev-Shwartz and Zhang (2014) and is the 
first to achieve acceleration for this problem, i.e. dependence only on the square root of the 
regularized problem’s condition number, which scales inversely with A. Hence, taking small 
A to solve the ERM problem (where A = 0 in effect) is not a viable option. 

In this paper we show how to bridge this gap via black-box reductions. Namely, we develop 
algorithms to solve the ERM problem (1) - under a standard assumption of strong convexity - 
through repeated, approximate minimizations of the regularized ERM problem min 3 ;{E(x) -|- Ar(x)} 
for fairly large A. Instantiating our framework with known randomized algorithms that solve the 
regularized ERM problem, we achieve accelerated running time guarantees for solving the original 
ERM problem. 

The key to our reductions are approximate variants of the classical proximal point algorithm 
(PPA) (Rockafellar, 1976; Parikh and Boyd, 2014). We show how both PPA and the inner minimiza¬ 
tion procedure can then be accelerated and our analysis gives precise approximation requirements 
for either option. Furthermore, we show further practical improvements when the inner minimizer 
operates by a dual ascent method. In total, this provides at least three different algorithms for 
achieving an improved accelerated running time for solving the ERM problem (1) under the stan¬ 
dard assumption of strongly convex F and smooth </>*. (Table 1 summarizes our improvements in 
comparison to existing minimization procedures.) 

Perhaps the strongest and most general theoretical reduction we provide in this paper is en¬ 
compassed by the following theorem which we prove in Section 3. 


Theorem 1.1 (Accelerated Approximate Proximal Point Algorithm). Let f : M"' —)> M 6e a /r- 
strongly convex function and suppose that, for all xq G M", c > 0, A > 0, we can compute a point 
Xc (possibly random) such that 


Ef{xc) - min<{ f{x) + ^\\x - Xo||i| < ^ 


f{xo) -mm<{ f{x) + - \\x-xq\\1 
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in time %■ Then given any xq, c > 0, A > 2/x, we can compute xi such that 


E/(xi) — min/(x) < 

X 


f{xo) - min/(x) 

X 


in time O yTV^logc^ . 

This theorem essentially states that we can use a linearly convergent algorithm for minimiz¬ 
ing f{x) + A||x — X 0 II 2 in order to minimize /, while incurring a multiplicative overhead of only 
0{\/\\/pl\ polylog(A//u)). Applying this theorem to previous state-of-the-art algorithms improves 
both the running time for solving (1), as well as the following more general ERM problem: 

n 

min where -;/)*: —)■ M. (2) 

t=l 

Problem (2) is fundamental in the theory of convex optimization and covers ERM problems for 
multiclass and structured prediction. 

There are a variety of additional extensions to the ERM problem to which some of our analysis 
easily applies. For instance, we could work in more general normed spaces, allow non-uniform 
smoothness of the 4>, add an explicit regularizer, etc. However, to simplify exposition and compari¬ 
son to related work, we focus on (1) and make clear the extensions to (2) in Section 3. These cases 
capture the core of the arguments presented and illustrate the generality of this approach. 

Several of the algorithmic tools and analysis techniques in this paper are similar in principle to 
(and sometimes appear indirectly in) work scattered throughout the machine learning and optimiza¬ 
tion literature - from classical treatments of error-tolerant PPA (Rockafellar, 1976; Guler, 1992) to 
the effective proximal term used by Accelerated Proximal SDCA Shalev-Shwartz and Zhang (2014) 
in enabling its acceleration. 

By analyzing these as separate tools, and by bookkeeping the error requirements that they 
impose, we are able to assemble them into algorithms with improved guarantees. We believe that 
the presentation of Accelerated APPA (Algorithm 2) arising from this view simplifies, and clarifies 
in terms of broader convex optimization theory, the “outer loop” steps employed by Accelerated 
Proximal SDCA. More generally, we hope that disentangling the relevant algorithmic components 
into this general reduction framework will lead to further applications both in theory and in practice. 


1.1 Formal setup 

We consider the ERM problem (1) in the following common setting: 

Assumption 1.2 (Regularity). Each loss function (pi is L-smooth, i.e. for all G M, 

(Piv) < (p{x) + p'{x){y -x) + ^{y- xf, 
and the sum F is pL-strongly convex, i.e. for all G 

F{x) > F{x) + VF{xy{y - x) + ^\\y - x\\l. 
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We let R maxj llaJb and let A G be the matrix whose z’th row is aj. We refer to 


def 


K = \LR^/iA 


as the condition number of ( 1 ). 

Although many algorithms are designed for special cases of the ERM objective F where there is 
some known, exploitable structure to the problem, our aim is to study the most general case subject 
to Assumption 1.2. To standardize the comparison among algorithms, we consider the following 
generic model of interaction with F\ 

Assumption 1.3 (Computational model). For any i G [n] and x G we consider two primitive 
operations: 

• For 6 G M, compute the gradient of x (l)i{ajx — b). 

• For 6 G M, c G minimize (j)i{ajx) + b\\x — c\\ 2 - 

We refer to these operations, as well as to the evaluation of4>i{aJx), as single accesses to (j)i, and 
assume that these operations can be performed in 0{d) time. 

Notation Denote [n] {1,... ,n}. Denote the optimal value of a convex function by = 

mmxf{x), and, when / is clear from context, let denote a minimizer. A point x' is an 
e-approximate minimizer of / if f{x') — /°p* < e. The Fenchel dual of a convex function / : 

—)■ M is /* : —)■ M defined by f*{y) = sup,^^^k{{y,x) — f{x)}. We use O(-) to hide factors 

polylogarithmic in n, L, y, A, and R, i.e. 0{f) = 0(/polylog(n, L,/U, A, i?)). 

Regularization and duality Throughout the paper we let T : —?• M denote a /r-strongly 

convex fnnction. For certain results presented, F must in particular be the ERM problem (1), 
while other statements hold more generally. We make it clear on a case-by-case basis when F must 
have the ERM structure as in (1). 

Beginning in Section 1.3 and throughout the remainder of the paper, we frequently consider the 
function fs,\{x), defined for all x, s G and A > 0 by 

fsAx)"^^ F{x) + ^\\x - s\\l (3) 

In snch context, we let argmin^, fs,\{x) and we call 

aca =' \LR^/X] 


the regularized condition number. 

When F is indeed the ERM objective (1), certain algorithms for minimizing fs^\ operate in the 
regularized ERM dual. Namely, they proceed by decreasing the negative dual objective gs,x ■ 

M, given by 

9s,x{y) G{y) + ^WA'^yWl - s'^A'^y, (4) 

where G{y) Similar to the above, we let argmiiiy gs^x{y)- 
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Empirical risk minimization 

Linear least-squares regression 

Algorithm 

Running time 

Problem 

Algorithm 

Running time 

Problem 

GD 

Accel. GD 
SAG, SVRG 
SDGA 
AP-SDGA 
APGG 

This work 

dn^Klog(eo/e) 
drA/‘^y/K\og{eo/e) 
dnK\og{eo/e) 
duKx log(eo/e) 
(in^/K3(log(eo/e) 
(inydfLlog(eo/e) 
dn^/K log{€o/e) 

F 

F 

F 

F + Xr 

F + Xr 

F + Xr 

F 

Naive mult. 
Fast mult. 

Row sampling 
OSNAP 

R. Kaczmarz 

Acc. coord. 

This work 

nd^~^ 

{nd + d‘^)log{eo/e) 
{nd + d^)log{eo/e) 
dnKlog{eo/e) 
dn^/K log{eo/e) 
dn^/K log{eo/e) 

||Ax-6||^ 
\\Ax-b\\l 
\\Ax-b\\l 
\\Ax-b\\l 
Ax = b 

Ax = b 
\\Ax-b\\l 


Table 1: Theoretical performance comparison on ERM and linear regression. Running times hold 
in expectation for randomized algorithms. In the “problem” column for ERM, F marks algorithms 
that can optimize the ERM objective (1), while F + Xr marks those that only solve the explicitly 
regularized problem. Eor linear regression, Ax = b marks algorithms that only solve consistent 
linear systems, whereas \\Ax — &II 2 marks those that more generally minimize the squared loss. The 
constant lx denotes the exponent of the matrix multiplication running time (currently below 2.373 
(Williams, 2012)). See Section 1.2 for more detail on these algorithms and their running times. 


To make corresponding primal progress, dual-based algorithms make use of the dual-to-primal 
mapping, given by 


Xs,\{y) '= 

and the primal-to-dual mapping, given entrywise by 


(x)]. = 


dcpijz) 

dz 


( 5 ) 


( 6 ) 


for i = 1,..., n. (See Appendix B for a derivation of these facts and further properties of the dual.) 


1.2 Running times and related work 

In Table 1 we compare our results with the running time of both classical and recent algorithms 
for solving the ERM problem (1) and linear least-squares regression. Here we briefly explain these 
running times and related work. 


Empirical risk minimization In the context of the ERM problem, GD refers to canonical gra¬ 
dient descent on F, Accel. GD is Nesterov’s accelerated gradient decent (Nesterov, 1983, 2004), 
SVRG is the stochastic variance-reduced gradient of Johnson and Zhang (2013), SAG is the stochas¬ 
tic average gradient of Roux et al. (2012) and Defazio et al. (2014), SDGA is the stochastic dual 
coordinate ascent of Shalev-Shwartz and Zhang (2013), AP-SDGA is the Accelerated Proximal 
SDGA of Shalev-Shwartz and Zhang (2014) and APCG is the accelerated coordinate algorithm 
of Lin et al. (2014). The latter three algorithms are more restrictive in that they only solve the 
explicitly regularized problem F + \r^ even if F is itself strongly convex (such algorithms run in 
time inversely proportional to A). 

The running times of the algorithms are presented based on the setting considered in this paper, 
i.e. under Assumptions 1.2 and 1.3. Many of the algorithms can be applied in more general settings 
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{e.g. even if the function F is not strongly convex) and have different convergence guarantees in 
those cases. The running times are characterized by four parameters: d is the data dimension, 
n is the number of samples, k = \LR‘^/fi] is the condition number (for F + Xr minimizers, the 
condition number kx = is used) and eo/e is the ratio between the initial and desired 

accuracy. Running times are stated per 0-notation; factors that depend polylogarithmically on n, 
At, and Kx are ignored. 

Linear least-squares regression For the linear least-squares regression problem, there is greater 
variety in the algorithms that apply. For comparison. Table 1 includes Moore-Penrose pseudoin¬ 
version - computed via naive matrix multiplication and inversion routines, as well as by their 
asymptotically fastest counterparts ~ in order to compute a closed-form solution via the standard 
normal equations. The table also lists algorithms based on the randomized Kaczmarz method 
(Strohmer and Vershynin, 2009; Needed et ah, 2014) and their accelerated variant (Lee and Sid- 
ford, 2013), as well as algorithms based on subspace embedding (OSNAP) or row sampling (Nelson 
and Nguyen, 2013; Li et ah, 2013; Cohen et ah, 2015). Some Kaczmarz-based methods only apply 
to the more restrictive problem of solving a consistent system (finding x satisfying Ax = b) rather 
than minimize the squared loss \\Ax — bill- The running times depend on the same four parameters 
n, d, K, eo/e as before, except for computing the closed-form pseudoinverse, which for simplicity we 
consider “exact,” independent of initial and target errors eo/e. 

Approximate proximal point The key to our improved running times is a suite of approximate 
proximal point algorithms that we propose and analyze. We remark that notions of error-tolerance 
in the typical proximal point algorithm - for both its plain and accelerated variants - have been 
defined and studied in prior work (Rockafellar, 1976; Guler, 1992). However, these mainly consider 
the cumulative absolute error of iterates produced by inner minimizers, assuming that such a 
sequence is somehow produced. Since essentially any procedure of interest begins at some initial 
point ” and has runtime that depends on the relative error ratio between its start and end - such 
a view does not yield fully concrete algorithms, nor does it yield end-to-end runtime upper bounds 
such as those presented in this paper. 

Additional related work There is an immense body of literature on proximal point methods 
and alternating direction method of multipliers (ADMM) that are relevant to the approach in this 
paper; see Boyd et al. (2011); Parikh and Boyd (2014) for modern surveys. We also note that the 
independent work of Lin et al. (2015) contains results similar to some of those in this paper. 

1.3 Main results 

All formal results in this paper are obtained through a framework that we develop for iteratively 
applying and accelerating various minimization algorithms. When instantiated with recently- 
developed fast minimizers we obtain, under Assumptions 1.2 and 1.3, algorithms guaranteed to 
solve the ERM problem in time 0{ndx/~K\og{l/e)). 

Our framework stems from a critical insight of the classical proximal point algorithm (PPA) or 
proximal iteration: to minimize F (or more generally, any convex function) it suffices to iteratively 
minimize 

fs,\{x) F{x) + |||x - s\\l 
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for A > 0 and proper choice of center s G PPA iteratively applies the update 

2 ;(i+i) ^ argmin/^(t) 

X ’ 

and converges to the minimizer of F. The minimization in the update is known as the proximal 
operator (Parikh and Boyd, 2014), and we refer to it in the sequel as the inner minimization 
problem. 

In this paper we provide three distinct approximate proximal point algorithms, i.e. algorithms 
that do not require full inner minimization. Each enables the use of existing fast algorithm as its 
inner minimizer, in turn yielding several ways to obtain our improved ERM running time: 

• In Section 2 we develop a basic approximate proximal point algorithm (APPA). The algo¬ 
rithm is essentially PPA with a relaxed requirement of inner minimization by only a fixed 
multiplicative constant in each iteration. Instantiating this algorithm with an accelerated, 
regularized ERM solver ~ such as APCG (Lin et ah, 2014) - as its inner minimizer yields the 
improved accelerated running time for the ERM problem (1). 

• In Section 3 we develop Accelerated APPA. Instantiating this algorithm with SVRG (Johnson 
and Zhang, 2013) as its inner minimizer yields the improved accelerated running time for both 
the ERM problem (1) as well as the general ERM problem (2). 

• In Section 4 we develop Dual APPA: an algorithm whose approximate inner minimizers 
operate on the dual fs^\, with warm starts between iterations. Dual APPA enables several 
inner minimizers that are a priori incompatible with APPA. Instantiating this algorithm 
with an accelerate, regularized ERM solver - such as APCG (Lin et ah, 2014) - as its inner 
minimizer yields the improved accelerated running time for the ERM problem (1). 

Each of the three algorithms exhibits a slight advantage over the others in different regimes. 
APPA has by far the simplest and most straightforward analysis, and applies directly to any p,- 
strongly convex function F (not only F given by (1)). Accelerated APPA is more complicated, but 
in many regimes is a more efficient reduction than APPA; it too applies to any /x-strongly convex 
function F and in turn proves Theorem 1.1. 

Our third algorithm. Dual APPA, is the least general in terms of the assumptions on which it 
relies. It is the only reduction we develop that requires the ERM structure of F. However, this 
algorithm is a natural choice in conjunction with inner minimizers that operate on a popular dual 
objective. In Section 5 we demonstrate moreover that this algorithm has properties that make it 
desirable in practice. 

1.4 Paper organization 

The remainder of this paper is organized as follows. In Section 2, Section 3, and Section 4 we state 
and analyze the approximate proximal point algorithms described above. In Section 5 we discuss 
practical concerns and cover numerical experiments involving Dual APPA and related stochastic 
algorithms. In Appendix A we prove general technical lemmas used throughout the paper and in 
Appendix B we provide a derivation of regularized ERM duality and related technical lemmas. 
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2 Approximate proximal point algorithm (APPA) 

In this section we describe our approximate proximai point aigorithm (APPA). This aigorithm 
is perhaps the simpiest, both in its description and in its anaiysis, in comparison to the others 
described in this paper. This section aiso introduces technicai machinery that is used throughout 
the sequei. 

We first present our formai abstraction of inner minimizers (Section 2.1), then we present our 
aigorithm (Section 2.2), and finaiiy we step through its anaiysis (Section 2.3). 

2.1 Approximate primal oracles 

To design APPA, we first quantify the error that can be tolerated of an inner minimizer, while 
accounting for the computational cost of ensuring such error. The abstraction we use is the following 
notion of inner approximation: 

Definition 2.1. An algorithm V is a primal (c, A)-oracle if, given x G it outputs V{x) that is 
{[fx,\ix) — f^^l]/c)-approximate minimizer of fx,\ in time Tp} 

In other words, a primal oracle is an algorithm initialized at x that reduces the error of fx^\ by 
a 1/c fraction, in time that depends on A, and c, and regularity properties of F. Typical iterative 
hrst-order algorithms, such as those in Table 1, yield primal (c, A)-oracles with runtimes Tp that 
scale inversely in A or ^/X, and logarithmically in c. For instance: 

Theorem 2.2 (SVRG as a primal oracle). SVRG (Johnson and Zhang, 2013) is a primal (c, A)- 
oracle with runtime complexity Fp = 0(n(i min{K, ka} log c) for both the ERM problem (I) and the 
general ERM problem (2). 

Theorem 2.3 (APCG as an accelerated primal oracle). Using APCG (Lin et al., 2014) we can ob¬ 
tain a primal (c, \)-oraele with runtime complexity Tp = 0{nd^/K\\og c) for the ERM problem (1).^ 

Proof. Gorollary B.3 implies that, given a primal point x, we can obtain, in 0{nd) time, a corre¬ 
sponding dual point y such that the duality gap fx,\{x) + gx,\{y) (and thus the dual error) is at 
most 0(poly(K:A)) times the primal error. Lemma B.l implies that decreasing the dual error by 
a factor 0(poly(K:A)c) decreases the induced primal error by c. Therefore, applying APGG to the 
dual and performing the primal and dual mappings yield the theorem. □ 

2.2 Algorithm 

Our Approximate Proximal Point Algorithm (APPA) is given by the following Algorithm I. 

^When the oracle is a randomized algorithm, we require that expected error is the same, i.e. that the solution be 
e-approximate in expectation. 

^AP-SDCA could likely also serve as a primal oracle with the same guarantees. However, the results in Shalev- 
Shwartz and Zhang (2014) are stated assuming initial primal and dual variables are zero. It is not directly clear how 
one can provide a generic relative decrease in error from this specific initial primal-dual pair. 



Algorithm 1 Approximate PPA (APPA) 

input X® G A > 0 
input primal (, A)-oracle V 
for t = 1,... ,T do 

end for 
output x^^^ 


The central goal of this section is to prove the following lemma, which guarantees a geometric 
convergence rate for the iterates produced in this manner 

Lemma 2.4 (Contraction in APPA). For any d G (0,1), x G and possibly randomized primal 
X)-oracle V (possibly randomized) we have 

E[F{V{x))] - F°P^ < (^F{x) - F°P^) . (7) 

A + /X 

This lemma immediately implies the following running-time bounds for APPA. 

Theorem 2.5 (Un-regularizing in APPA). Given a primal (, \)-oracle V, Algorithm 1 min¬ 
imizes the general ERM problem (2) to within accuracy e in time 0{Tp\X/ti) log(eo/e)).^ 

Combining Theorem 2.5 and Theorem 2.3 immediately yields our desired running time for 
solving (1). 

Corollary 2.6. Instantiating Algorithm 1 with the Theorem 2.3 as the primal oracle and taking 
X = fi yields the running time of O{nd^/Klog{eo/e)) for solving (1). 


2.3 Analysis 


This section gives a proof of Lemma 2.4. Throughout, no assumption is made on F aside from 
/i-strong convexity. Namely, we need not have F be smooth or at all differentiable. 

First, we consider the effect of an exact inner minimizer. Namely, we prove the following lemma 
relating the minimum of the inner problem fg^x to 

Lemma 2.7 (Relationship between minima). For all s G and A > 0 


ropt 

•is,A 


popt ^ 


A 

/X + A 


{F{s) - F°P^) . 


Proof. Let x°p* = argmin^, F{x) and for all a G [0,1] let Xa = (1 
convexity of F implies that, for all a G [0,1], 


a)s -|- q;x°p*. The ^-strong 


F(xa) < (1 - a)F(s) + aF(x°P*) - 


Consequently, by the definition of 


A, 


/°A < ^(Xa) + 2 ll^« - ® 


Aa^, 


|2 < (1 - + aF(x°P^) - _ 3;Opt||2 ^ ^ 


s-x°Pii 


Choosing a = yields the result. 

®When the oracle is a randomized algorithm, the expected accuracy is at most e. 


□ 
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This immediately implies contraction for the exact PPA, as it implies that in every iteration of 
PPA the error in F decreases by a multiplicative A/(A + //). Using this we prove Lemma 2.4. 

Proof of Lemma 2.4- Let x' = P{x). By definition of primal oracle P we have 


- fS s (f-AA - fS) ■ 

Combining this and Lemma 2.7 we have 

AAA) - ^ [UAA - it) + -jffi (HA - 

Using that clearly for all z we have F{z) < fx,x{z) we see that F{x') < fx,\{x') and 
Combining with the fact that fx,\{x) = F{x) yields the result. □ 


3 Accelerated APPA 

In this section we show how generically accelerate the APPA algorithm of Section 2. Accelerated 
APPA (Algorithm 2) uses inner minimizers more efficiently, but requires a smaller minimization 
factor when compared to APPA. The algorithm and its analysis immediately prove Theorem 1.1 
and in turn yield another means by which we achieve the accelerated running time guarantees for 
solving (1). 

We first present the algorithm and state its running time guarantees (Section 3.1), then prove 
the guarantees as part of analysis (Section 3.2). 

3.1 Algorithm 

Our accelerated APPA algorithm is given by Algorithm 2. In every iteration it still makes a single 
call to a primal oracle, but rather than requiring a fixed constant minimization the minimization 
factor depends polynomial on the ratio of A and n. 


Algorithm 2 Accelerated APPA 

input X® G W^, ^ > 0, X > 2fi 
input primal (4/9^/^, A)-oracle V, where p = 
Define C = ^ + y 

for t = 0,..., T — 1 do 

^ (i+^) + (i5^) 

a;(i+l) ^ P(yW) 

^A(2/W-x(‘+i)) 

y{t+P _j_ p-l/2 ^y{t) _ 

end for 
output 


The central goal is to prove the following theorem regarding the running time of APPA. 
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Theorem 3.1 (Un-regularizing in Accelerated APPA). Given a primal (4( ^'’^+/^ )3/2, X^.oracle V 
for A > 2u, Alqorithm 2 minimizes the qeneral ERM problem (2) to within accuracy e in time 
0{Tpy/\X/Ji\ log(eo/e)). 

This theorem is essentially a restatement of Theorem 1.1 and by instantiating it with Theo¬ 
rem 2.2 we obtain the following. 

Corollary 3.2. Instantiating Theorem 3.1 with SVRG (Johnson and Zhang, 2013) as the primal 
oracle and taking \ = 2pL + LR^ yields the running time bound Ofnd^/n log{eo/e)) for the general 
ERM problem (2). 


3.2 Analysis 

Here we establish the convergence rate of Algorithm 2, Accelerated APPA, and prove Theorem 3.1. 
Note that as in Section 2 the results in this section use nothing about the structure of F other than 
strong convexity and thus they apply to the general ERM problem (2). 

We remark that aspects of the proofs in this section bear resemblance to the analysis in Shalev- 
Shwartz and Zhang (2014), which achieves similar results in a more specialized setting. 

Our proof is split into the following parts. 

• In Lemma 3.3 we show that applying a primal oracle to the inner minimization problem gives 
us a quadratic lower bound on F{x). 

• In Lemma 3.4 we use this lower bound to construct a series of lower bounds for the main 
objective function /, and accelerate the APPA algorithm, comprising the bulk of the analysis. 

• In Lemma 3.5 we show that the requirements of Lemma 3.4 can be met by using a primal 
oracle that decreases the error by a constant factor. 

• In Lemma 3.6 we analyze the initial error requirements of Lemma 3.4. 


The proof of Theorem 3.1 follows immediately from these lemmas. 


Lemma 3.3. For xq G M”" and e > 0 suppose that x^ is an e-approximate solution to fxo,\- Then 
for jj ijl/2, g A(xo — x'^), and all x G M” we have 


F{x) > F(x+) 



hi 

2 





\ + 2pl' 


Note that as /i' = ^/2 we are only losing a factor of 2 in the strong convexity parameter for our 
lower bound. This allows us to account for errors without sacrificing in our ultimate asymptotic 
convergence rates. 


Proof. Since F is /U-strongly convex clearly fxQ,\ is fv + \ strongly convex, by Lemma A.l 


fxo,xix) - fxoAx^x) ^ 


/X -|- A I 


\x — X 


opt II2 


( 8 ) 


By Cauchy-Schwartz and Young’s Inequality we know that 


X p.' 


\x — X 


-p|2 


12 ^ 


< 


X p' 




I II opt II 2 I T p) II opt 


xT i — X 
xo,X 
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which implies 


/i + A I 


opt ||2 ^ _ ^+I |2 _ ^ II opt _ +||2 


X — xj \ 115 > 

Xo,AII 2 — 2 


\X — X Wo — 




\xj: ^ — X ' 

xo,A 


On the other hand, since fxo,x{x~^) < fxo,\{x^^^)+e by assumption we have x°P *||2 < 

and therefore 


fxoMx) - fxoAx^) > fxo,x{x) - fxoAxf^^x) - « ^ 
> 


/t + A II Qpt 112 

\x — xj^ \ 9 — e 

I xo,AII^ 


~-+l |2 ^ ^ + ft II opt +||2 


X — X 9 — 


ft' 


3 ^ xo , a -^ Il 2 -e 


Now since 


A + //' 2 A + 2/i' 

- 2 " "2 fi' 


\x - x-^Wl = ||x - xo + ^^lli = Ik - +‘^{ 9 ,x- xo) + ^\\g\\l, 


and using the fact that fxo,x{x) = F{x) + ^||x — xqIH, we have 


F(x) >F{x'^) + 


- + — 

A 2A2 


ft 


ft 


Ibll2+(l + y )(<?,^-a;o) + ^lk-xor- 


i2 A + ‘Ifj! 


-e. 


ft' 


The right hand side of the above equation is a quadratic function. Looking at its gradient with 
respect to x we see that it obtains its minimum when x = xq — (A + t)^ and has a minimum value 


of F(x+) - ^Iklli - 


l_ll^l|2 _ A+2p' 


e. 


□ 


Lemma 3.4. Suppose that in each iteration t we have '0°^*+ ^Ik “ t;*^*k 2 such that F{x) > 
iptix) for all X. Let p for A > 3k; and let 


(t) 

yW = 




p-l/2 

l+p-1/2 

■ill 


,(t) 


9(0 S - ll'+'l), 


• ^(t + 1) = [\ — p l/2j.y(t) _|_ p 1/2 

ITe have 


it) _ I ait) 


hi + x) 9 


E[F(x(*k - < 1 - 


9-1/2' 


(F(xo) - 07 k 


Proof. Regardless of how is chosen we know by Lemma 3.3 that for 7 = 1 + k and all x G M” 

FM > F7«>)-kii9'‘>iiLy Ik (/' - kOlIrkk -/;sL) ■ (9) 
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Thus, for j3 = 1 — p we can let 


def 


i^t+i{x) = PMx) + (1 - / 3 ) 


F(a^"+”)-T||9«)||l + ^| 


a: - I !/**’ - ^9**’ ) 111 






= /3 


l*r +yII^- 9“>I|1 +(!-« F(i"+'>)-^ll9'''lll + yl|i-(9 


_ K,W _ Igit) 


Ai 




where in the last line we used Lemma A.3. Again, by Lemma A.3 we know that 

d = + (!-/?) (t(x(‘+^)) - - 4t,> 


> /AV't + (1 - /?)T(x(‘+i)) - ||5« Hi + /3(1 - /3)7 - 4*) 


2p' 

jJi ^ ’ 

In the second step we nsed the following fact: 


1-/3 , „„ .,,9' 73 1-/3, , , ^ (l-«3 


2 //' 


2 //' 


2 //' 


2^' 


Furthermore, expanding the term ^||(x — and instantiating x with in (9) yields 

F(i<‘+1)) < F(iW) - i||9l<>||3 + 7 - i<‘>) + L^(4 „|_,(i(‘+1)) - /“P^), 

Consequently we know 

F(xi‘+i))-, 9 “S<w(iW)-,^ri+ 

(A + 2p') 


il-f3? /3 

2/r' A 


Il4‘^lli+ 7/3 - xW - (1 -- 


+ 




_ ^opt 7 


Note that we have chosen so that the inner prodnct term equals 0, and we choose (3 = 1—p > 


^ which ensures 


(l-/34_^< 


1 


2p' 


1 

- w < 0. 


A “ 2(y' + A) 2A 

7 - 3/2 

, 7 ; . 1^1 
y 


Also, by assumption we know IE[/j^(t) — f°f^ ^ 4 (/(^^*^) “ V't*’*); which implies 


E[F(x(*+i))-VitTi]< /3 + 


(A + 2p') p- 3 / 2 ' 




(T(xW) - ) < (1 - p-'/v2)(+(xW) - )• 
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In the final step we are using the fact that <2p and /? > 1. 


□ 


Lemma 3.5. Under the setting of Lemma 3.4, we have fyO) ~ f°(t) x — ~ 


particular, in order to achieve IE[/j^(t) —(-F(xW) — we only need an oracle that 

shrinks the function error by a factor of ^— (in expectation). 

Proof. We know 


\x' ' - y' 


P 


-1 


2 (l + /,-l/2)2 


|xW-^W|li. 


We will try to show the lower bound /°(J) ^ is larger than by the same amount. This is because 
for all X we have 


fyO),xix) = F{x) + ^\\x - + ^\\x- + ^\\x - 


A, 


The right hand side is a quadratic function, whose optimal point is at x = ^ ^ whose 

optimal value is equal to 




A 




2 \p' + X 




A 


/i + A 




p'X 


2{y' + X) (l + p-i/2)2 


IxW-uW 


By definition of p we know 2{^'+X) ' (i+p-i/2)2 Hi exactly equal to ^ ~ 

uW|||, therefore fyip^xi^^^'’) “ /°w,a - “ V’^*- □ 


Remark In the next lemma we show that moving to the regularized problem has the same effect 
on the primal function value and the lower bound. This is a result of the choice of fd in the proof 
of Lemma 3.4. However, this does not mean that the choice of (5 is very fragile. We can choose any 
ff that is between the current (5 and 1; the effect on this lemma will be that the increase in primal 
function becomes smaller than the increase in the lower bound (so the lemma continues to hold). 

Lemma 3.6. Let = F{x^^^) — — f°^^), and then -ipo + 

y\\x — uolP is a valid lower bound for F. In particular when X = LR^ then F{x^^^) — ipQ^^ < 

2k(F(x(°)) 

Proof. This lemma is a direct corollary of Lemma 3.3 with x"*" = x^*^/ □ 


4 Dual APPA 

In this section we develop Dual APPA (Algorithm 3), a natural approximate proximal point al¬ 
gorithm that operates entirely in the regularized ERM dual. Our focus here is on theoretical 
properties of Dual APPA; Section 5 later explores aspects of Dual APPA more in practice. 

We first present an abstraction for dual-based inner minimizers (Section 4.1), then present the 
algorithm (Section 4.2), and finally step through its runtime analysis (Section 4.3). 
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4.1 Approximate dual oracles 

Our primary goal in this section is to quantify how much objective function progress an algorithm 
needs to make in the dual problem, (See Section 1.1) in order to ensure primal progress at a 
rate similar to that in APPA (Algorithm 1). 

Here, similar to Section 2.1, we formally define our requirements for an approximate dual-based 
inner dual minimize. In particular, we use the following notion of dual oracle. 

Definition 4.1. An algorithm V is a dual (c, A)-oracle if, given s ^ and y G M”', it outputs 
V{s,y) that is a i[gs,xiy) — g°/l]/c)-approximate minimizer of gs,x in time Td-^ 

Dual based algorithms for regularized ERM and variants of coordinate descent typically can be 
used as such a dual oracle. In particular we note that APCG is such a dual oracle. 

Theorem 4.2 (APCG as a dual oracle). APCG (Lin et al, 2014) is a dual {c, \)-oracle with 
runtime complexity Tv = 0(ndydfA logc).® 

4.2 Algorithm 

Our dual APPA is given by the following Algorithm 3. 


Algorithm 3 Dual APPA 
input G A > 0 

input dual {a, A)-oracle T (see Theorem 4.3 for a) 

yW ^ 

for t = 1,... ,T do 

y{t) 

end for 
output 


Dual APPA (Algorithm 3) repeatedly queries a dual oracle while producing primal iterates via 
the dual-to-primal mapping (5) along the way. We show that it obtains the following running time 
bound: 

Theorem 4.3 (Un-regularizing in Dual APPA). Given a dual {a, X)-oracle T>, where 

a > max{K, kx}\\/ g) 

Algorithm 3 minimizes the ERM problem (1) to within accuracy e in time 0 {Td\ X/g\ log(eo/e)).® 

Combining Theorem 4.3 and Theorem 4.2 immediately yields another way to achieve our desired 
running time for solving (1). 

in the primal oracle definition, when the oracle is a randomized algorithm, we require that its output be an 
expected e-approximate solution. 

®As in Theorem 2.3, AP-SDCA could likely also serve as a dual oracle with the same guarantees, provided it is 
modified to allow for the more general primal-dual initialization. 

®As in Theorem 2.5, when the oracle is a randomized algorithm, the expected accuracy is at most e. 
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Corollary 4.4. Instantiating Theorem 4-3 with Theorem 4-^ the dual oracle and taking \ = g. 
yields the running time bound 0{ndy/K log{eo/e)). 

While both this result and the results in Section 2 show that APCG can be used to achieve our 
fastest running times for solving (1), note that the algorithms they suggest are in fact different. In 
every invocation of APCG in Algorithm 1, we need to explicitly compute both the primal-to-dual 
and dual-to-primal mappings (in 0{nd) time). However, here we only need to compute the primal- 
to-dual mapping once upfront, in order to initialize the algorithm. Every subsequent invocation 
of APCG then only requires a single dual-to-primal mapping computation, which can often be 
streamlined. From a practical viewpoint, this can be seen as a natural “warm start” scheme for 
the dual-based inner minimizer. 


4.3 Analysis 

Here we proves Theorem 4.3. We begin by bounding the error of the dual regularized ERM problem 
when the center of regularization changes. This characterizes the initial error at the beginning of 
each Dual APPA iteration. 


Lemma 4.5 (Dual error after re-centering.). For all y G M”, x G and x' = Xx{y) we have 

9.'Ay) - aXx < 2(<7x,A(y) - sS) + 4nK [Fix') - F°^^ + Fix) - 

In other words, the dual error gs,\iy) — 5° a* bounded across a re-centering step by multiples 
of previous sub-optimality measurements (namely, dual error and gradient norm). 

Proof. By the definition of gx,\ and x' we have, for all z, 

9x',\iz) = Giz) + = gx,\iz) - (x' - x)'^A^z = gx,\iz) + ^y'^AA^z . 


Furthermore, since g is 4-strongly convex we can invoke Lemma A.2 obtaining 


9x',xiy) - 9lf^x - 2 9x,xiy) - 9lf\ 


opt 


+ L 




Since each row of A has £2 norm at most R we know that ||^ 2:||2 < II-2^11 2 we know that by 
definition A'^y = A(x — x'). Combining these yields 


9x',xiy) - 9lf\ < 2 9x,xiy) - 9 


opt 
',A 


-t- nLEi^Wx — x'l 


Finally, since F is ^-strongly convex, by Lemma A.l, we have 

^||x - x'Wl < II 2 ;' - x°P*||i + ||x - x°P*||^ < - [Fix') - F°P* + Fix) - F°pt] . 

^ jJj 

Combining and recalling the definition of k yields the result. □ 

The following lemma establishes the rate of convergence of the primal iterates produced 

over the course of Dual APPA, and in turn implies Theorem 4.3. 
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Lemma 4.6 (Convergence rate of Dual APPA). Let c' G (0,1) he arbitrary and suppose that 
a > (40/c')n^K| max{K, KA}rA//i] in Dual APPA (Algorithm 3). Then in every iteration t > 1 of 
Dual APPA (Algorithm 3) the following invariants hold: 


(f(x<“>) - F"'*) , and (10) 

‘ ■ ( 11 ) 

Proof. For notational convenience we let r Aj ft fxW A> D F{x^^'l) — 

popt £qj. t > 0. Thus, we wish to show that et-i < r^~^eQ (equivalent to (11)) and we wish to 
show that g't_i(y0)) — gt-\ < (equivalent to (10)) for alH > 1. 

By definition of a dual oracle we have, for all t > 1, 




rfh < - 


cr L 




sT\ 


( 12 ) 


by Lemma B.l we have, for all t > 1, 


/<-l(x<‘>) - < 2n\i L_i(9l‘') - sr. 


(13) 


by Lemma 4.5 we know 


«.(!/'*')-sr'S 2 


g,-i(:A)-sr 


+ inn{at + C(_i), 


(14) 


and by Lemma 2.7 we know that for all t > 1 

^o_pt _ ^opt < (15) 

fj. -r A 

Furthermore, by Corollary B.3, the definition of y^^\ and the facts that = F(x^^'^) and 

ftiz) > F{z) we have 

5o(y^°^) - gT < 2«a (/o(x(°)) - C') < 2AtA (t(x( 0)) - = 2KAeo (16) 

We show that combining these and applying strong induction on t yields the desired result. 

We begin with our base cases. When t = 1 the invariant (11) holds immediately by definition. 
Furthermore, when t = 1 we see that the invariant (10) holds, since a > 2 kx and 


, 2ka 

< -eoi 

a 


(17) 


9o(S<‘') - 9 S-‘ < l(9o(!/‘"') - sT) < ^ (/o(4^'°’) - /o°"‘ 

were we used (12) and (16) respectively. Finally we show that invariant (11) holds for t = 2: 

F(x(^)) - < fo{x^^^) - /qP* + /qP* - F°P* (Since F{z) < ft{z) for all t, z 


A 


< 2n'^Kl{go{y^^'^) - g°P*) + ——eo 


/i + A 


< 


4n^f 


a 


+ 


fjj A X 


eo 


< reo 


(Equations (13) and (15)) 

(Equation (17)) 
(Since a > Ann^/ (c'A/(^ + A))) 
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Now consider t > 3 for the second invariant (11). We show this holds assuming the invariants 
hold for all smaller t. 

F(a;(i-1)) _ i^opt < _ ^o_pt ^ ^o_pt _ popt 

< 2r?K\{gt- 2 {yt-i) - 5°-2) -^^‘-2 (Equations (13) and (15)) 

// + A 

2?7^a^^ / \ A 

< -^ {gt- 2 {yt- 2 ) - gtll) H-(Equation (12)) 

a \ / fi -\- A 

Furthermore, 

gt- 2 {yt- 2 ) - gt -2 < ‘^{gt-i{yt- 2 ) - gt%) + 4nK [et _2 + et- 3 ] (Equation (17)) 

< (2r*“^ + 4nK(r*“^ + r*~^)) cq (Inductive hypothesis) 

< 10nKr*“^eo (r’ < 1 and k > 1) 


Since cr > 20n^K|K/(c'A/(/x + A)) combining yields that 


2n?K\ 

a 



< 


/r + A 


eo 


and the result follows by the inductive hypothesis on et- 2 - 

Finally we show that invariant (10) holds for any t > 2 given that it holds for all smaller t and 
invariant (11) holds for that t and all smaller t. 


gt-i{y^^^) 


9^\ 


< 

< 

< 

< 


-{gt-i{yt-i) - gtl\) 

(7 

- 2{g^it-2){yt-i) - glft-2)) + 4nK [et-i + € 4 - 2 ] 

— [2r*“^ + 4nK [r* + eo 


(Definition dual oracle.) 

(Equation (14)) 

(Inductive hypothesis) 
{a > Shk) 


The result then follows by induction. 


□ 


5 Implementation 

In the following two subsections, respectively, we discuss implementation details and report on an 
empirical evaluation of the APPA framework. 

5.1 Practical concerns 

While theoretical convergence rates lay out a broad-view comparison of the algorithms in the 
literature, we briefly remark on some of the finer-grained differences between algorithms, which 
inform their implementation or empirical behavior. To match the terminology used for SVRG in 
Johnson and Zhang (2013), we refer to a “stage” as a single step of APPA, i.e. the time spent 
executing the inner minimization of f^(t) x or g^(t) ^ (as in (3) and (4)). 
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Re-centering overhead of Dual APPA vs. SVRG At the end of every one of its stages, 
SVRG pauses to compute an exact gradient by a complete pass over the dataset (costing Q{nd) 
time during which n gradients are computed). Although an amortized runtime analysis hides this 
cost, this operation cannot be carried out in-step with the iterative updates of the previous stage, 
since the exact gradient is computed at a point that is only selected at the stage’s end. 

Meanwhile, if each stage in Dual APPA is initialized with a valid primal-dual pair for the inner 
problem. Dual APPA can update the current primal point together with every dual coordinate 
update, in time 0(d), i.e. with negligible increase in the overhead of the update. When doing so, 
the corresponding data row remains fresh in cache and, unlike SVRG, no additional gradient need 
be computed. 

Moreover, initializing each stage with a valid such primal-dual pair can be done in only 0{d) 
time. At the end of a stage where s was the center point. Dual APPA holds a primal-dual pair 
(x, y) where x = Xs(y). The next stage is centered at x and the dual variables initialized at y, so 
it remains to set up a corresponding primal point x' = Xx{y) = x — jA^y. This can be done by 
computing x' <— 2x — s, since we know that x — s = —^AJy. 

Decreasing A APPA and Dual APPA enjoy the nice property that, as long as the inner problems 
are solved with enough accuracy, the algorithm does not diverge even for large choice of A. In 
practice this allows us to start with a large A and make faster inner minimizations. If we heuristically 
observe that the function error is not decreasing rapidly enough, we can switch to a smaller A. 
Figure 3 (Section 5.2) demonstrates this empirically. This contrasts with algorithm parameters 
such as step size choices in stochastic optimizers (that may still appear in inner minimization). 
Such parameters are typically more sensitive, and can suddenly lead to divergence when taken too 
large, making them less amenable to mid-run parameter tuning. 

Stable update steps When used as inner minimizers, dual coordinate-wise methods such as 
SDGA typically provide a convenient framework in which to derive parameter updates with data- 
dependent step sizes, or sometimes enables closed-form updates altogether (i.e. optimal solutions to 
each single-coordinate maximization sub-problem). For example, when Dual APPA is used together 
with SDCA to solve a problem of least-squares or ridge regression, the locally optimal SDGA 
updates can be performed efficiently in closed form. This decreases the number of algorithmic 
parameters requiring tuning, improves the overall the stability of the end-to-end optimizer and, in 
turn, makes it easier to use out of the box. 

5.2 Empirical analysis 

We experiment with Dual APPA in comparison with SDGA, SVRG, and SGD on several binary 
classification tasks. 

Beyond general benchmarking, the experiments also demonstrate the advantages of the unordi¬ 
nary “bias-variance tradeoff” presented by approximate proximal iteration: the vanishing proximal 
term empirically provides advantages of regularization (added strong convexity, lower variance) 
at a bias cost that is less severe than with typical ^2 regularization. Even if some amount of ^2 
shrinkage is desired. Dual APPA can place yet higher weight on its ^2 term, enjoy improved speed 
and stability, and after a few stages achieve roughly the desired bias. 
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Datasets In this section we show resuits for three binary ciassification tasks, derived from 
MNIST,' CIFAR-10,® and Protein:® in MNIST we ciassify the digits {1,2,4, 5,7} vs. the rest, 
and in CIFAR we ciassify the animai categories vs. the automotive ones. MNIST and CIFAR are 
taken under non-linear feature transformations that increase the problem scale significantly: we 
normalize the rows by scaling the data matrix by the inverse average I 2 row norm. We then take 
take n/5 random Fourier features per the randomized scheme of Rahimi and Recht (2007). This 
yields 12K features for MNIST (60K training examples, lOK test) and lOK for CIFAR (50K train¬ 
ing examples, lOK test). Meanwhile, Protein is a standard pre-featurized benchmark (75 features, 
~117K training examples, ~30K test) that we preprocess minimally by row normalization and an 
appended affine feature, and whose train/test split we obtain by randomly holding out 20% of the 
original labeled data. 

Algorithms Each algorithm is parameterized by a scalar value A analogous to the A used in 
proximal iteration: A is the step size for SVRG, is the decaying step size for SGD, and 

|||x||| is the ridge penalty for SDGA. (See Johnson and Zhang (2013) for a comparison of SVRG to 
a more thoroughly tuned SGD under different decay schemes.) We use Dual APPA (Algorithm 3) 
with SDGA as the inner minimizer. For the algorithms with a notion of a stage - i.e. Dual APPA’s 
time spent invoking the inner minimizer, SVRG’s period between computing exact gradients - we 
set the stage size equal to the dataset size for simplicity.^® SVRG is given an advantage in that we 
choose not to count its gradient computations when it computes the exact gradient between stages. 
All algorithms are initialized at x = 0. Each algorithm was run under A = 10* for i = —8, —7,..., 8, 
and plots report the trial that best minimized the original ERM objective. 

Convergence and bias The proximal term in APPA introduces a vanishing bias for the problem 
(towards the initial point of x = 0) that provides a speedup by adding strong convexity to the prob¬ 
lem. We investigate a natural baseline: for the purpose of minimizing the original ERM problem, 
how does APPA compare to solving one instance of a regularized ERM problem (using a single run 
of its inner optimizer)? In other words, to what extent does re-centering the regularizer over time 
help in solving the un-regularized problem? Intuitively, even if SDGA is run to convergence, some 
of the minimization is of the regularization term rather than the ERM term, hence one cannot 
weigh the regularization too heavily. Meanwhile, APPA can enjoy more ample strong convexity 
by placing a larger weight on its I 2 term. This advantage is evident for MNIST and CIEAR in 
Pigures 1 and 2: recalling that A is the same strong convexity added both by APPA and by SDGA, 
we see that APPA takes A at least an order of magnitude larger than SDGA does, to achieve faster 
and more stable convergence towards an ultimately lower final value. 

Eigure I also shows dashed lines corresponding to the ERM performance of the least-squares 
fit and of fully-optimized ridge regression, using A as that of the best APPA and SDGA runs. 
These appear in the legend as “ls(A).” They indicate lower bounds on the ERM value attainable 
by any algorithm that minimizes the corresponding regularized ERM objective. Lastly, test set 
classification accuracy demonstrates the extent to which a shrinkage bias is statistically desirable. 

^ htt p: / / yann. lecun .com/exdb / mnist / 

®http;//www.cs.toronto.edu/~kriz/cifar.html 

®http;//osmot.cs.cornell.edu/kddcup/datasets.html 

^°Such a choice is justified by the observation that doubling the stage size does not have noticeable effect on the 
results discussed. 
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- sdca(O.l) - svrg(l.O) - appa(l.O) 'sgd(l.O) - ls(0) - Is(O.l)-Is(l.O) 



- sdca(O.l) - svrg(l.O) - appa(l.O) ^sgd(l.O) - ls(0) - Is(O.l)-Is(l.O) 



(b) CIFAR. Left: excess train loss F{x) — Right: test error rate. 


- sdca(O.l) - svrg(O.Ol) - appa(O.l) ^sgd(l.O) - ls(0) - Is(O.l)-Is(O.l) 



(c) Protein. Left: excess train loss F{x) — Right: test error rate. 


Figure 1: Sub-optimality curves when optimizing under squared loss = 7^{z — biY. 
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- sdca(le — 08) - svrg(lO.O) - appa(O.l) ‘ -’-"sgd(lOOO.O) 



(a) MNIST. Left: train loss F{x). Right: test error rate. 


- sdca(O.Ol) - svrg(lO.O) - appa(O.l) =sgd(1000.0) 



(b) CIFAR. Left: train loss F{x). Right: test error rate. 


Figure 2: Objective curves when optimizing under logistic loss 4>i{z) = ^ log(l + e 


In the MNIST and CIFAR holdout, we want only the small bias taken explicitly by SDCA (and 
effectively achieved by APPA). In the Protein holdout, we want no bias at all (again effectively 
achieved by APPA). 

Parameter sensitivity By solving only regularized ERM inner problems, SDCA and APPA 
enjoy a stable response to poor specification of the biasing parameter A. Figure 3 plots the al¬ 
gorithms’ final value after 20 stages, against different choices of A. Overestimating the step size 
in SGD or SVRG incurs a sharp transition into a regime of divergence. Meanwhile, APPA and 
SDCA always converge, with solution quality degrading more smoothly. APPA then exhibits an 
even better degradation as it overcomes an overaggressive biasing by the 20th stage. 
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sdca svrg appa ' ■’ sgd 



(a) Squared loss. Left: MNIST. Right: CIFAR. 


sdca 


- svrg 


appa 


sgd 



(b) Logistic loss. Left: MNIST. Right: CIFAR. 

Figure 3: Sensitivity to A: the final objective values attained by each algorithm, after 20 stages (or 
the equivalent), with A chosen at different orders of magnitude. SGD and SVRG exhibit a sharp 
threshold past which they easily diverge, whereas SDCA degrades more gracefully, and Dual APPA 
yet more so. 
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A Technical lemmas 

In this section we provide several stand-alone technical lemmas we use throughout the paper. First 
we provide Lemma A.l some common inequalities regarding smooth or strongly convex functions, 
then Lemma A.2 which shows the effect of adding a linear term to a convex function, and then 
Lemma A.3 a small technical lemma regarding convex combinations of quadratic functions. 

Lemma A.l (Standard bounds for smooth, strongly convex functions). Let / : —>■ M he differ¬ 

entiable function that obtains its minimal value at x°^^. 

If f is L-smooth then for all x G 

^I|V/(I)||^ < /W - /(.o < |||i' - . 

If f is ^.-strongly convex the for all x 

^\\X - < fix) - fix"'') < i||V/(i)||2 . 

Proof. Apply the definition of smoothness and strong convexity at the points x and and 
minimize the resulting quadratic form. □ 

Lemma A.2. Let f : M” —)• M 6e a ja-strongly convex function and for all a,x G let fa{x) = 
f{x) + offX. Then 

fa{x)-rff^<2{f{x)-r^) + -\\a\\l 

Proof. Let = argmin^,/(x). Since / is ^-strongly convex by Lemma A.l we have /(x) > 
/(x°P*) -|- ^||x — x°P *||2 for all x. Consequently, for all x 

fT > fix) + a^x > /(x°P*) + f Ik - x°Pki + a^x > - x^p^ + f Ik " 

^^Note we could have also proved this by appealing to the gradient of / and Lemma A.l, however the proof here 
holds even if / is not differentiable. 
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Minimizing with respect to x yields that ^Halil- Consequently, by Cauchy 

Schwarz, and Young’s Inequality we have 


fa{x) - /r < /(^) - + ^llalli 


< fix) - + ^llalli + ^||x - x°Pii + -||a||2 


-I 

2 ' 


1 

2 //' 


Applying A.l again yields the result. 

Lemma A. 3 . Suppose that for all x we have 


( 18 ) 

( 19 ) 

□ 


then 


where 


fi{x) = 'ipi + ^Wx- viWl and /2(x) = -02 + ^lk - U2III 


afiix) + (1 - a)f2ix) = -00 + ^||x - u«||2 


Va = ctvi + (1 — a)v2 and 0q, = a0i + (1 — a)02 + ^®(1 “ ct)||^'i — '^2112 


Proof. Setting the gradient of afiix) + (1 — a)/ 2 (x) to 0 we know that Va must satisfy 

a/i iva - Ui) + (1 - a)lJ, iva - V 2 ) =0 
and thus Va = avi + (1 — a)v 2 . Finally, 

llja = a 'f'l + ^\\Xa - VlWl +(l-a) 1p2 + ^\\Va - V2\\2 

= Q!01 + (1 - a)02 + ^ [a(l - a)^||f2 - '^illl + (1 “ a)a^||u2 - vi\\l] 

= Q!01 + (1 - a)'tlj2 + ^a(l - a)||ui - U 2 III. 


□ 


B Regularized ERM duality 

In this section we derive the dual ( 4 ) to the problem of computing proximal operator for the ERM 
objective ( 3 ) (Section B.I) and prove several bounds on primal and dual errors (Section B. 2 ). 
Throughout this section we assume F is given by the ERM problem (I) and we make extensive use 
of the notation and assumptions in Section 1 . 1 . 

B.I Dual derivation 

We can rewrite the primal problem, rninx fs^xix), as 

min YA=i 4 >iizi) + - s\\l 

subject to Zi = ajx, for i = 1,..., n 
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By convex duality, this is equivalent to 

n 

min max 


/ 4>i{zi) +-\\x - s\\l + y'^(Ax - z) = max min } (l)i{zi) + -\\x-s\\l + y^{Ax-z) 

2" ' y ^ 2" 

2=1 2=1 


Since 


va.m.{(t)i{zi) - ViZi} = -maxly^Zj - 4>i{zi)} = -(piiVi) 


Zi 


and 


min < —||x — sill + As + min | —||x — s||| + y'^A{x — s)l = y'^As — 7^||^ 

^[2 J ^[2 J 2/a 


T„ ||2 
2) 


it follows that the optimization problem is in turn equivalent to 

n 




— mm 
y 


i=l 


2 A 


This negated problem is precisely the dual formulation. 

The first problem is a Lagrangian saddle-point problem, where the Lagrangian is defined as 

^ A 

^{x,y,z) = + 2 11^ “ sWl + y'^'i^x - z). 

2 = 1 

The dual-to-primal mapping ( 5 ) and primal-to-dual mapping (6) are implied by the KKT conditions 
under jC, and can be derived by solving for x, y, and z in the system V£(x, y, z) = 0 . 

The duality gap in this context is defined as 

g^Y>s,\{x,y) fs,\{x)+gs,\{y)- (20) 

Strong duality dictates that gapj, x{x, y) > 0 for all x G M'^, y G M”, with equality attained when x 
is primal-optimal and y is dual-optimal. 


B.2 Error bounds 

Lemma B.l (Dual error bounds primal error). For all s G y G M", and A > 0 we have 

fs,x{xsAy)) - fsf < HnKxf{9s,x{y) - gl%)- 

Proof. Because F is nR^L smooth, fg^x is nE?L + A smooth. Consequently, for all x G we have 

nR^L + A, 


fs,x{x) - fS < 


\x — X 


opt||2 


2 "" 

Since we know that = s — and ||A'''z||2 < ni^^HzlH for all z G M” we have 


fs,xixx,x{y)) - fs,xix°^x) ^ - jA'^y - (s - ^A^y°P^)\\l 


nR^L + A 


Wy-y?'-"^ 


< 


2^2 ^s,X WaA^ 

nR‘^{nR‘^L + X) ^pt 2 
-^3^2- \\y-ys,x\\ 2 - 


( 21 ) 
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Finally, since each cj)* is 1/L-strongly convex, G is 1/L-strongly convex and hence so is gs^\. There¬ 
fore by Lemma A.l we have 

^\\y-y°s^\\2< 9s,x{y) -9s,\{y°^x)- ( 22 ) 

Substituting (22) in (21) and recalling that kx > 1 yields the result. □ 

Lemma B.2 (Gap for primal-dual pairs). For all s, x G and X > 0 we have 

SaPs,\{x,y{x)) = ^||VF(x)||i + ^||x - s\\l (23) 

Proof. To prove the first identity (23), let y = y{x) for brevity. Recall that 

m = (l^iiajx) G argmax{x’'’ai7/i - <()■ (y*)} (24) 

Vi 


by definition, and hence x'^Oiffi — (t>*{yi) = 4>i{ajx). Observe that 

n 

S&Vs,\{x,y) = + ^iiyi)) - x'^A^yP ApT-||2 + _ ^1 


i=l 


E 

i=l 


( 


(piiajx) + (yi) - x'^aSi -h X\\A^yf -h |||x - s 


V 


=0 (by (24)) 


= + tlk-s| 


^)\\^ + ill® “ ■®l 


2=1 


= ^\\VF{x)f + l\\x-sf. 


□ 


Corollary B.3 (Initial dual error). For all s,x G and X > 0 we have 

9 x,\{y{x)) - < 2 kx (^fx,\{x) - 

Proof. By Lemma B.2 we have 

gaPx,A(2;,y(a;)) = ^l|VF(x)|||-g ^||x-x||^ = ^||VF(x)||| 

Now clearly VT(x) = ^fx,x{x). Furthermore, since fx,x{x) is {nLP? + A)-smooth by Lemma A.l 
we have \\Vfx,xix)\\ < 2{nLR^ + X){fx,x{x) - Z^*)- Consequently, 


5x,A(y(a:)) -yj - gaPa;,A(a;,y(a:)) < 


2{nLR^ + A) 
2A 


[fx,x{x) - fl] 


opt 

A 


Recalling the definition of k,x and the fact that 1 < ka yields the result. 


□ 
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